AITopics | vap model

Collaborating Authors

vap model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment

Inoue, Koji, Okafuji, Yuki, Baba, Jun, Ohira, Yoshiki, Hyodo, Katsuya, Kawahara, Tatsuya

arXiv.org Artificial IntelligenceMar-8-2025

Turn-taking is a crucial aspect of human-robot interaction, directly influencing conversational fluidity and user engagement. While previous research has explored turn-taking models in controlled environments, their robustness in real-world settings remains underexplored. In this study, we propose a noise-robust voice activity projection (VAP) model, based on a Transformer architecture, to enhance real-time turn-taking in dialogue robots. To evaluate the effectiveness of the proposed system, we conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system. Our analysis covered both subjective user evaluations and objective behavioral analysis. The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation where both the robot and users responded faster. The subjective evaluations suggested that faster responses contribute to a better interaction experience.

interaction, robot, vap model, (13 more...)

arXiv.org Artificial Intelligence

2503.06241

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)
Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Applying General Turn-taking Models to Conversational Human-Robot Interaction

Skantze, Gabriel, Irfan, Bahar

arXiv.org Artificial IntelligenceJan-15-2025

Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2501.08946

Country:

Europe (0.93)
North America > United States (0.28)
Asia > Japan (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
(2 more...)

Add feedback

Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

Inoue, Koji, Lala, Divesh, Skantze, Gabriel, Kawahara, Tatsuya

arXiv.org Artificial IntelligenceOct-21-2024

In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.

backchannel, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.15929

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden (0.04)
Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.68)

Add feedback

Multilingual Turn-taking Prediction Using Voice Activity Projection

Inoue, Koji, Jiang, Bing'er, Ekstedt, Erik, Kawahara, Tatsuya, Skantze, Gabriel

arXiv.org Artificial IntelligenceMar-14-2024

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

dialogue, prediction, vap model, (15 more...)

arXiv.org Artificial Intelligence

2403.06487

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden (0.04)
Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

Inoue, Koji, Jiang, Bing'er, Ekstedt, Erik, Kawahara, Tatsuya, Skantze, Gabriel

arXiv.org Artificial IntelligenceJan-9-2024

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

participant, real-time and continuous turn-taking prediction, transformer, (9 more...)

arXiv.org Artificial Intelligence

2401.04868

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.06)
Europe > Sweden (0.05)
Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Architecture > Real Time Systems (0.86)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Comparative Analysis of Machine Learning Models for Early Detection of Hospital-Acquired Infections

Harvey, Ethan, Dong, Junzi, Ghosh, Erina, Samadani, Ali

arXiv.org Artificial IntelligenceNov-15-2023

As more and more infection-specific machine learning models are developed and planned for clinical deployment, simultaneously running predictions from different models may provide overlapping or even conflicting information. It is important to understand the concordance and behavior of parallel models in deployment. In this study, we focus on two models for the early detection of hospital-acquired infections (HAIs): 1) the Infection Risk Index (IRI) and 2) the Ventilator-Associated Pneumonia (VAP) prediction model. The IRI model was built to predict all HAIs, whereas the VAP model identifies patients at risk of developing ventilator-associated pneumonia. These models could make important improvements in patient outcomes and hospital management of infections through early detection of infections and in turn, enable early interventions. The two models vary in terms of infection label definition, cohort selection, and prediction schema. In this work, we present a comparative analysis between the two models to characterize concordances and confusions in predicting HAIs by these models. The learnings from this study will provide important findings for how to deploy multiple concurrent disease-specific models in the future.

comparative analysis, infection, prediction model, (12 more...)

arXiv.org Artificial Intelligence

2311.09329

Country:

North America > United States > Massachusetts > Middlesex County > Medford (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.74)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.72)

Add feedback

What makes a good pause? Investigating the turn-holding effects of fillers

Jiang, Bing'er, Ekstedt, Erik, Skantze, Gabriel

arXiv.org Artificial IntelligenceMay-3-2023

Filled pauses (or fillers), such as "uh" and "um", are frequent in spontaneous speech and can serve as a turn-holding cue for the listener, indicating that the current speaker is not done yet. In this paper, we use the recently proposed Voice Activity Projection (VAP) model, which is a deep learning model trained to predict the dynamics of conversation, to analyse the effects of filled pauses on the expected turn-hold probability. The results show that, while filled pauses do indeed have a turn-holding effect, it is perhaps not as strong as could be expected, probably due to the redundancy of other cues. We also find that the prosodic properties and position of the filler has a significant effect on the turn-hold probability. However, contrary to what has been suggested in previous work, there is no difference between "uh" and "um" in this regard.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2305.02101

Country:

North America > United States > Colorado > Boulder County > Boulder (0.14)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)

Add feedback